Module 1 - Lab 2

Word Embeddings & Neural Representations

1
M01:
Lab
Hands-on lab activity: Interacting with Textual Data in Jupyter and Colab.
Published

November 21, 2024

Modified

February 17, 2026

Lab Overview

This lab transitions you from count-based representations (Bag of Words) to learned semantic representations (word embeddings), which form the foundation of modern neural NLP and generative AI systems.

  • Bag-of-Words treats language as counts
  • Embeddings treat language as geometry
  • Neural networks learn representations, not rules

This lab explains how and why that shift happens.

Learning Objectives

After completing this lab, you should be able to:

  • Explain the limitations of Bag-of-Words
  • Describe how distributional semantics works
  • Understand how neural networks learn word meaning
  • Connect backpropagation to representation learning
  • Articulate why embeddings matter for business analytics

From Bag-of-Words to Distributed Meaning

Bag-of-Words assumes:

  • Words are independent
  • Order does not matter
  • Meaning ≈ frequency

This creates well-known failures:

  • Synonyms look unrelated
  • Negation is ignored
  • Context is lost

This violates a core linguistic principle:

“You shall know a word by the company it keeps.” — J.R. Firth

Embeddings operationalize this idea mathematically.

What Is a Word Embedding?

A word embedding is a dense numerical vector where:

  • Each dimension encodes latent semantic features
  • Similar words lie close in vector space
  • Relationships emerge geometrically

Instead of counting words:

  • We predict context
  • The prediction error updates vectors
  • Vectors move closer/farther based on usage

Meaning is not labeled, it is learned implicitly.

1 Loading SEC 10-K Text Data

  • Load raw text documents into memory for tokenization and modeling.
    • SEC 10-K filings:
    • Are long, unstructured
    • Contain domain-specific language
    • Are ideal for testing semantic models
  • Locates all SEC 10-K text files
  • Verifies dataset availability
  • Ensures reproducibility across Colab sessions
from pathlib import Path

# Colab Specific: Adjust path to your Google Drive location
# DATA_DIR = Path("/content/drive/MyDrive/SEC-10K-2024")
# files = sorted(DATA_DIR.glob("*.txt"))

# Local drive only do this if you have the data on your local machine
DATA_DIR = Path("../../data/SEC-10K-2024")
files = sorted(DATA_DIR.glob("*.txt"))

len(files)

2 Minimal Tokenization

  • Tokenization converts raw text into atomic units (tokens) that models can process.
  • We intentionally keep this simple to:
    • Avoid hiding complexity
    • Emphasize representation learning
    • Focus on meaning, not preprocessing tricks
  • Lowercasing ensures consistency
  • Removing punctuation reduces noise
  • Filtering short tokens removes artifacts
import re

def tokenize(text):
    text = re.sub(r"[^a-z\s]", " ", text)
    return [t for t in text.split() if len(t) > 2]
  • Creates clean word sequences
  • Preserves semantic structure
  • Prepares data for neural learning

3 Training Word Embeddings (Word2Vec)

Word2Vec trains a shallow neural network that learns word vectors by:

  • Predicting nearby words (Skip-gram), or
  • Predicting a word from its context (CBOW)
One-hot word → Embedding layer → Context prediction → Loss
  • The embedding layer is the model.
    • Learns 100-dimensional word vectors
    • Uses local context (window=5)
    • Ignores rare/noisy terms
    • Trains via stochastic gradient descent
from gensim.models import Word2Vec

model = Word2Vec(
    sentences=tokenized_docs,
    vector_size=100,
    window=5,
    min_count=5,
    workers=2
)

4 Inspecting Learned Meaning

Once trained: - Distance = similarity - Direction = relationship - Arithmetic ≈ semantics - No rules were written. - No labels were provided. - Yet meaning emerges.

model.wv.most_similar("risk", topn=5)
  • Which concepts the model associates
  • How business language clusters
  • Whether learning aligns with intuition

5 How Does the Neural Network Learn Meaning?

5.1 Key Idea

The network is not told what words mean.

It:

  1. Makes a prediction
  2. Measures error
  3. Updates vectors

Meaning is the byproduct of optimization.


6 Backpropagation — Conceptual Explanation

6.1 What Backpropagation Does

Backpropagation computes:

How much should each parameter change to reduce error?

6.2 In Embeddings

  • Parameters = word vectors
  • Loss = incorrect context prediction
  • Gradient = direction to move vectors

Each training step nudges vectors into better semantic positions.


7 Why This Matters for Generative AI

7.1 Conceptual Continuity

Large Language Models:

  • Still use embeddings
  • Still use backpropagation
  • Still optimize prediction error

What changes is:

  • Scale
  • Architecture depth
  • Training data size

The core idea remains identical.


8 Business Relevance

Representation Business Value
Bag-of-Words Audits, baselines
Embeddings Similarity, clustering
Neural models Prediction, generation

8.1 SEC 10-K Applications

  • Risk similarity detection
  • Peer benchmarking
  • Topic drift analysis
  • Early warning signals

9 Deliverables

Answer the following (conceptual, not code-heavy):

  1. Why do embeddings outperform Bag-of-Words for financial text?
  2. How does backpropagation enable semantic learning?
  3. Name one business analytics task improved by embeddings.

10 Conceptual Takeaway

Bag-of-Words counts language. Embeddings learn language.

This lab completes your transition from classical text analytics to neural NLP foundations.